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PROBLEMS  AND  RESULTS 


Research  under  this  contract  focused  on  three  basic  problem 
areas:  (1)  "critical  data  analysis"  (whose  aim  is  to  provide 

more  formal  inferences  accompanying  techniques  of  exploratory 
data  analysis),  (2)  distribution  shapes  that  tend  to  arise  in 
real  data,  and  (3)  computer  implementation  of  some  advanced 
techniques  in  exploratory  data  analysis. 

Critical  Data  Analysis 

Topics  of  interest  for  our  work  on  critical  data  analysis 
included  robust /resistant  methods  in  analysis  of  variance, 
properties  of  robust/resistant  methods  as  compared  to 
nonparametric  methods,  multiplicity  and  simultaneous  confidence, 
and  robust  estimation  in  nonsymmetric  situations. 

During  the  period  of  the  contract,  our  research  concentrated 
primarily  on  analysis  of  variance  and  related  models. 

In  analyzing  data  that  take  the  form  of  a  two-way  layout 
it  is  often  helpful  to  consider  models  that  involve  both 
additive  and  multiplicative  terms.  One  common  model  of  this 
type  decomposes  y^j ,  the  value  of  the  response  variable  in 
row  i  and  column  j,  according  to 


ID 


=  M 


a  .  +  B •  +  <Y  ■  6 •  +  e . . 
i  pj  Ti  3  13 


where,  when  one  is  fitting  by  least  squares,  la^  =  ZBj  = 

2  2 

lY^  =  16 j  =  0,  Ey^  =  16 j  =  1,  and  the  are  uncorrelated. 
Within  this  framework  we  studied  some  consequences  of  the 
nonresistance  inherent  in  least-squares  fitting  and  investigated 


a  robust/resistant  approach  to  fitting  such  additive-plus- 
multiplicative  models. 

When  one  uses  least  squares  to  fit  an  additive-plus- 
multiplicative  model,  a  perturbation  of  the  y-value  in  a 
single  cell  can  have  far  greater  impact  on  the  fitted  value 
in  the  same  and  other  cells  than  is  predicted  by  the  formula 
for  leverage  in  the  additive  model 


y..  =  u  +  a-  +  8 .  +e.. 
y  ID  i  e’i;j 


Emerson,  Hoaglin,  and  Kempthorne  (1984)  derived  formulas  for 
generalizations  of  leverage  in  particular  cases. 

To  get  around  such  difficulties  with  least-squares  fitting 
Emerson  and  Wong  (1985)  further  developed  an  approach  based 
on  the  exploratory  technique  known  as  median  polish.  By 
making  row  and  column  sign  changes  in  the  table  of  additive 
residuals  and  then  transforming  to  a  logarithmic  scale,  this 
approach  produces  resistant  additive-plus-multiplicative  fits 
and  can  also  provide  a  basis  for  fitting  further  multiplicative 
terms.  Hoaglin,  Wong,  and  Emerson  (1983)  used  this  procedure 
as  part  of  a  broad  framework  for  resistant  diagnosis  of 
interaction  in  two-way  layouts.  Also,  Emerson,  Hoaglin,  Tukey, 
and  Wong  (1985)  illustrated  this  approach  to  additive-plus- 
multiplicative  models,  together  with  other  resistant  techniques 
in  reanalyzing  a  classical  set  of  data  on  the  perceived 
favorableness  of  15  adjectives  when  modified  by  each  of  9 


adverbs . 


In  a  related  area  to  analysis  of  variance,  Kempthorne 
(1984),  working  in  part  frojn  a  Bayesian  viewpoint,  devised 
a  procedure  for  identifying  influential  groups  of  observations 
in  multiple  regression.  By  using  a  direction  search  to  reveal 
"derivative-influential"  data,  it  offers  advantages  (especially 
in  computing  effort)  over  other  methods. 

To  provide  some  inferential  support  for  one  frequently 
used  technique  of  exploratory  data  analysis,  Hoaglin,  Iglewicz, 
and  Tukey  (1985)  carried  out  an  extensive  study  (both 
theoretical  and  empirical)  of  a  class  of  resistant  rules  for 
labeling  possible  outliers  in  univariate  samples.  One  main 
motivation  is  that,  by  using  measures  of  location  and  spread 
that  are  themselves  relatively  insensitive  to  moderate  numbers 
of  sour  observations,  these  rules  can  avoid  most  of  the  problems 
that  many  other  outlier-detection  rules  encounter  when  a  sample 
may  contain  several  outliers.  The  resistant  rules  use  the 
lower  fourth  FL  and  upper  fourth  (approximate  quart iles) 
of  the  sample  to  set  up  cutoffs 

FL  “  ~  and  Fu  +  k^FU  “  rL) 

and  label  as  possible  outliers  any  observations  that  fall 
outside  these  cutoffs.  The  main  rule  used  in  exploratory  data 
analysis  has  k  =  1.5  for  all  sample  sizes  n.  An  important 
aspect  of  a  rule's  performance  is  its  "outside  rate  per 
sample"  (the  probability  that  a  sample  of  n  contains  at  least 
one  "outside"  observation).  Our  work  showed  that  this  rule's 


outside  rate  per  sample  ranges  roughly  from  15  to  35  percent 
in  Gaussian  samples  of  5  to  50  and  generally  increases  with 
n.  Another  characteristic,  the  outside  rate  per  observation, 
is  roughly  2  to  5  percent  around  n  =  10  and  decreases  as  1/n 
to  a  Gaussian  asymptotic  value  of  0.7  percent.  This  finding 
is  of  considerable  interest,  because  the  outside  rate  per 
observation  is  much  higher  in  small  to  moderate  samples  than 
intuition,  based  primarily  on  the  population  value,  had 
suggested.  Hoaglin,  Iglewicz,  and  Tukey  also  developed  (1) 
a  very  good  theoretical  approximation  for  the  outside  rate 
per  Gaussian  sample  that  applies  to  many  rules  of  the  above 
form  and  (2)  a  satisfactory  approximation,  based  on  the  ratio 
of  independent  linear  combinations  of  independent  exponential 
variates,  for  the  outside  rate  per  sample  in  a  class  of 
heavier-tailed  distributions. 

Distribution  Shape 

One  major  objective  of  the  research  on  distribution  shape 
is  a  better  understanding  of  the  variety  and  characteristics 
of  distributions  that  arise  in  actual  data,  in  part  as  a  basis 
for  judging  the  degree  of  robustness  that  various  statistical 
analyses  might  require  in  practice.  As  a  framework  for 
studying  these  questions  in  continuous  data,  we  have  used 
Tukey' s  family  of  g-and-h  distributions,  which  permit  more 
resistant  estimates  of  shape  parameters  and  offer  greater 
flexibility  than  the  traditional  third  and  fourth  moments. 
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In  the  g-and-h  distributions  the  parameter  g  controls  skewness 
(departure  from  symmetry),  and  the  parameter  h  controls 
elongation  (heavier  tails).  The  basic  random  variable  Y  (to 
which  location  and  scale  parameters  can  be  applied)  is  given, 

in  terms  of  a  standard  Gaussian  random  variable  Z,  by 

Y  =  g-he*2  -  l)ehz2/2  . 

Estimation  of  g  and  h  customarily  begins  with  sample  quantiles. 

Because  the  behavior  of  the  resistant  estimators  of  g 
and  h  had  received  little  attention,  Godfrey  (1985)  studied 
them  in  a  variety  of  situations  in  which  the  data  come  from 
known  theoretical  distributions,  including  Gaussian,  lognormal. 
Student's  t  with  small  degrees  of  freedom,  and  contaminated 
Gaussian.  The  sample  sizes  were  100,  200,  500,  and  1000. 

She  found  that  simple  resistant  estimators  of  g  and  h  have 
distributions  very  close  to  Gaussian,  appear  to  be  unbiased, 
and  have  variances  in  good  agreement  with  the  values  predicted 
by  the  asymptotic  formulas  that  she  derived.  The  fact  that 
these  variances  are  not  especially  small  confirms  the  belief 
that  one  needs  samples  of  at  least  several  hundred  observations 
to  learn  much  about  distribution  shape.  Her  results  also 
suggest  that  those  resistant  estimators  of  g  and  h  are 
substantially  more  variable  than  the  corresponding  maximum- 
likelihood  estimators. 

To  illustrate  the  analysis  of  distribution  shape,  as  well 
as  to  prepare  a  procedure  for  later,  more  routine  studies. 


Godfrey  applied  the  g-and-h  techniques  to  several  moderate 
to  large  published  data  sets,  ranging  in  size  from  100  to 
1500  observations.  Godfrey,  Hoaglin,  and  Mosteller  began 
an  ongoing  program  of  collecting  sizable  samples,  frequency 
distributions,  and  data  sets  from  a  variety  of  sources. 

The  g-and-h  distributions  are  also  valuable  in  approximating 
quantiles  of  non-Gaussian  theoretical  distributions.  Godfrey 
used  this  approach  to  obtain  good  new  approximations  for  the 
quantiles  of  the  chi-squared  and  t  distributions. 

In  other  work  related  to  the  g-and-h  distributions,  Hoaglin 
(1985)  described  a  method  for  fitting  these  distributions 
to  binned  frequency  distributions. 

Although  discrete  distributions  pose  rather  different 
problems  for  description  of  shape,  several  of  the  most  common 
families  (including  Poisson,  binomial,  and  negative  binomial) 
are  amenable  to  flexible  resistant  checking.  Hoaglin  and  Tukey 
(1985)  substantially  improved  the  Poissonness  plot  and 
developed  several  new  techniques  for  checking  the  shape  of 
discrete  frequency  distributions. 

Software 

Work  in  this  area  was  designed  to  make  selected  advanced 
techniques  of  exploratory  data  analysis  more  readily  accessible 
for  application  by  implementing  them  in  Fortran.  One  major 
product  was  a  set  of  subroutines  that  provide  almost  all  the 
new  techniques  for  diagnosing  discrete  frequency  distributions 
described  by  Hoaglin  and  Tukey  (1985). 


Also,  other  aspects  of  the  overall  research  led  to  the 
development  of  related  software,  designed  for  more  than  casual 
internal  use.  The  work  on  outlier  labeling,  for  example, 
produced  an  algorithm  for  evaluating  the  cumulative  distribution 
function  of  the  ratio  of  two  independent  linear  combinations 
of  independent  exponential  random  variables.  In  addition, 
some  software  took  the  form  of  macros  for  the  Minitab  statistical 
system.  These  included  the  singular  value  decomposition 
(primarily  for  fitting  additive-plus-multiplicative  models 
by  least  squares),  the  biweight  location  estimator,  and 
biweight  polish  for  two-way  tables. 
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